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Abstract 

We introduce a new approach for estimating the 3D 
pose and the 3D shape of an object from a single image. 
Given a training set of view exemplars, we learn and select 
appearance-based discriminative parts which are mapped 
onto the 3D model from the training set through a facil¬ 
ity location optimization. The training set of 3D models is 
summarized into a sparse set of shapes from which we can 
generalize by linear combination. Given a test picture, we 
detect hypotheses for each part. The main challenge is to 
select from these hypotheses and compute the 3D pose and 
shape coefficients at the same time. To achieve this, we op¬ 
timize a function that minimizes simultaneously the geomet¬ 
ric reprojection error as well as the appearance matching 
of the parts. We apply the alternating direction method of 
multipliers (ADMM) to minimize the resulting convex func¬ 
tion. We evaluate our approach on the Fine Grained 3D 
Car dataset with superior performance in shape and pose 
errors. Our main and novel contribution is the simultane¬ 
ous solution for part localization, 3D pose and shape by 
maximizing both geometric and appearance compatibility. 


1. Introduction 

Geometric features were the main representation in ob¬ 
ject recognition in the 20th century [11]. Images of 3D ob¬ 
jects were usually assumed to be segmented out and cor¬ 
respondence of well defined image features to projections 
of vertices or edges were established through voting for ge¬ 
ometric consistency. Although such approaches were suc¬ 
cessful with geometric invariance they could not cope with 
the complexity of appearance of 3D objects in the real world 
which could only be learnt from exemplars. As soon as 
such 2D image exemplars became available in the Internet 
and through tedious annotation by the community, appear¬ 
ance based approaches exploded and the computer vision 
community is proud of the state of the art in detecting ob¬ 
ject categories via a bounding box or even segmenting them 
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[7]. Pose variation in 3D objects was converted into a 2D 
problem by clustering view exemplars into different classes 
[15, ] ]. Recently, researchers have realized that different 
views of the same 3D object can be married with existing 
2D approaches like the Deformable Part Models by either 
extending the pictorial structures to three dimensions or ren¬ 
dering views of actual 3D CAD models [23, 24, 37, 1' ]. 

We believe that three are the main challenges in the mar¬ 
riage of 2D appearance and 3D geometry: (1) how to learn 
a representation that efficiently predicts the appearance of 
geometric features given a pose and shape, (2) how to op¬ 
timize for appearance and correspondence compatibility as 
well as 3D pose at the same time, without splitting the prob¬ 
lem into subproblems of discrete poses, and (3) how to es¬ 
tablish the 3D shape of an object when we want to avoid 
comparing serially with all possible 3D instances or when 
that instance has not been seen before. While 2D pictorial 
structures have been used in order to capture deformations 
for the sake of detection, we are genuinely interested in es¬ 
tablishing the actual 3D shape of an object for the sake of 
fine grained classification or 3D interaction like grasping 
and manipulation. 

In this paper, we propose a novel approach that marries 
the power of discriminative parts with an explicit 3D geo¬ 
metric representation with the goal to infer 3D pose as well 
as 3D shape of an object from a single image. We use the 
power of discriminative learning of parts to learn part de¬ 
scriptors in 2D training images enriched with the projection 
of a wired 3D model. Such parts are centered around projec¬ 
tions of 3D landmarks which are given in abundance on the 
3D model. To establish a compact representation we mini¬ 
mize the number of needed landmarks by solving a facility 
location linear program where the selection maximizes si¬ 
multaneously discriminatively as well as the ’’serving” of 
the 3D landmarks that will be left out. 

Given a learned part model for each landmark we de¬ 
tect top hypotheses for the location of each landmark in 
a testing image. The challenge is how to fit best these 
parts by maximizing the geometric consistency. This en¬ 
tails the selection among the hypotheses of each part and the 
pose/shape computation. Unlike other approaches which 
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Figure 1: Illustrative summary of our approach: 3D Landmarks on a 3D model are associated with discriminatively learned 
part descriptors (left). Intra-class shape variation is captured with linear combinations of a sparse shape basis (2nd left). 
Learned part descriptors produce multiple maximum responses for each part in a testing image (3rd from left). The selection 
of the part hypotheses, 3D pose and 3D shape are simultaniously estimated and the result is illustrated through a popup 
(right). 


initialize pose by detecting DPM-based discretized poses 
[37, 21], we compute the selection as well as the 3D pose in 
one step using a mean-shape of the object category. We are 
able to achieve this by formulating as a convex optimiza¬ 
tion problem solvable by the alternating direction method 
of multipliers (ADMM). Subsequently, we apply two prun- 
ings of the hypotheses for each landmark projection. First, 
we prune by visibility induced by the estimated pose, and 
second we prune by proximity after solving for pose from 
the visible landmarks. One final application of the ADMM 
optimization solves for the pose as well as the shape. Joint 
pose and shape optimization is achieved by joining the co¬ 
efficients with respect to a sparse shape basis and the 3D 
rotation parameters into one matrix variable for each shape 
basis. 

The main contributions of our approach are: 

• A compact learned representation of part descriptors 
corresponding to 3D landmarks. 

• Resolving the 2D-3D marriage by simultaneously op¬ 
timizing for appearance compatibility and geometric 
consistency. Unlike pairwise constraints in 2D pic¬ 
torial structures or graphical models, here structure is 
formulated as global geometric consistency. 

• Global geometric consistency does not mean rigidity, 
and unlike RANSAC based approached localizing 3D 
instances, we can deal with deformation by estimating 
the coefficients with respect to a shape basis. 

• Unlike approaches based on nonlinear minimization 
[15, 21, 3 ], we do not need initial estimates and we 
are not stuck on local minima. 


Our paper follows a classic organization, starting with 
the related work (Sec. 2), the learning of the representation 
in Sec. 3, the inference in Sec. 4, and results in Sec. 5. 
Figure 1 illustrates the outline of our approach. 

2. Related Work 

Our model representation is inspired by recent advances 
in part-based modeling [7, 28, 14, 17], which models the 
appearance of object classes with mid-sized discriminative 
parts. 

The most popular approach to 3D object detection and 
viewpoint classification is to represent a 3D object by a 
collection of view-dependent 2D models separately trained 
on discretized views. Examples of this approach include 
[27, 31,7, 30, 12, 24]. While these methods have shown su¬ 
perior detection performance, they provide relatively weak 
information about 3D geometry of objects. Some recent 
works directly used 3D models to encode the geometric re¬ 
lations among local parts and achieved continuous pose esti¬ 
mation [34, 26, 19, 29, 10, 8, 33, 20, 23, 1]. But they either 
used generic class models or instance-based models. Our 
approach differs in that we not only provide detailed shape 
representation but also consider intra-class variability. 

The most related category of methods is the one based on 
a shape-space model and tackling the recognition problem 
by aligning the shape model to image features. This ap¬ 
proach originated from the active shape model (ASM) [4], 
which was originally used for segmentation and tracking 
based on low-level image features. Cristinacce and Cootes 
[5] proposed the constrained local models (CLM), which 
combined ASM with local appearance models for 2D fea¬ 
ture localization in face images. Gu and Kanade [13] pre- 




















sented a method to align 3D deformable models to 2D im¬ 
ages for 3D face alignment. The similar methods were also 
proposed for 3D car modeling [15, 37, 21] and human pose 
estimation [25, 35]. Our method differs in that we use a 
data-driven approach for discriminative landmark selection 
and we solve landmark localization and shape reconstruc¬ 
tion in a single convex framework, which enables the prob¬ 
lem to be solved globally. 

Our optimization approach is related to the previous 
work on using convex relaxation techniques for objet 
matching, e.g. [22, 16, 18]. These methods focused on 
finding the point-to-point correspondence between an ob¬ 
ject template and a testing image in 2D, while our method 
considers 3D to 2D matching as well as shape variability. 

3. Shape Constrained Discriminative Parts 

Our proposed method models both 2D appearance vari¬ 
ation and 3D shape deformation of an object class. The 2D 
appearance is modeled as a collection of discriminatively 
trained parts. Each part is associated with a 3D landmark 
point on a deformable 3D shape. 

Unlike the previous works that manually define land¬ 
marks on the shape model, we propose an automatic se¬ 
lection scheme: we first learn the appearance models for 
all points on the 3D model, evaluate their detection perfor¬ 
mance, and select a subset of them as our part models based 
on their detection performance in 2D and the spatial cover¬ 
age in 3D. 

3.1. Learning Discriminative Parts 

One of the main challenges in object pose estimation 
rises from the fact that due to perspective transform and 
self occlusions, even the same 3D position of an object has 
very different 2D appearances in the image observed from 
different viewpoints. We tackle this problem by learning a 
mixture of discriminative part models for each point in the 
3D model to capture the variety in appearance. 

Given a training set D , each training image Ii G D is 
associated with the 3D points of the object shape S G M 3xp , 
their 2D projections Li G R 2xp annotated in the image, and 
their visibility Vi G {0, 1} P . For each visible 3D point j G 
{1,..., p}, in training image Ii , we extract an N x TV image 
patch centered at its 2D location L^ as positive example, 
assuming all the images are resized such that the object is 
approximately at the same scale. Negative examples are 
randomly extracted patches that do not have overlap with 
the object. 

We bootstrap the learning of a discriminative mixture 
model for each part via clustering. Recent work [14, 28] 
has shown that whitened HOG (WHO) achieves better clus¬ 
tering results than HOG [ 6 ]. Denote (p{L i; j) as the HOG 
feature of the positive image patch centered at Lij and <p hg 
as the mean of background HOG features. We compute 


the WHO feature as E _1 / 2 (0(L^-) — </> bg ), where E is the 
shared covariance matrix computed from all positive and 
negative features. Then we cluster the WHO features of 
each part j into m clusters using K-means as the initializa¬ 
tion of the mixture model. 

A linear classifier W c j is trained for each cluster c of 
a part j. We apply linear discriminant analysis due to ef¬ 
ficiency in training and limited loss in detection accuracy 
[14, 9], 

W cj = 5T 1 (<p(Lij ; Zij = c) - ~4> b% ) , (1) 

where G {1,..., m} is the cluster assignment for each 
feature, and </>(Lij m ,Zij = c) is the mean feature over all 
of cluster c. Let x = (x,y) be the position (x, y ) in 
the image. The response of part j at a given location x is 
the max response over all its c components: score 7 -(x) = 
ma x c {W c j • 0(x)}. 

Due to intra-class variation and viewpoint differences, 
the appearance of training patches may not be perfectly 
aligned. Such misalignment results in inferior detector per¬ 
formance. We introduce a latent variable for each training 
patch, Tij G M 2 to represent the relative center location to 
the annotated landmark location L^ . To improve the classi¬ 
fier, we use the classifiers learned from (1) to reposition the 
patch center in the neighborhood A (Lij) of L^ and update 
and rij as 

x*ij = argmax x£ A(iy} score j (x), 

= argma x c {W cj ■ ^(x^)}, 

r ij = X ij ~ Lij- 

The classifier weights are then retrained as 

VI c j = ^ Vij \ Zij = c) ^bg) 5 (2) 

where (j){Lij +r^; z^ = c ) is the mean feature over aligned 
patches of cluster c. Note that the latent update procedure 
is similar to that of DPM [7] with the difference that we do 
not apply generalized distance transform to filter responses 
but only consider maximum responses within a local region. 
The reason is that our model, as will be discussed in Sec¬ 
tion 3.3, is constrained by the 3D shape space instead of 
learned 2D deformations. We want, thus, to obtain accu¬ 
rate part localization to estimate the object pose and shape. 
Figure 2 shows an example comparison of the mean image 
patch and the filter learned from clustering and after sev¬ 
eral iterations of latent update. With the latent update step, 
the image patches of the training set become better aligned 
resulting in more concentrated weights in the learned filter. 
After the latent update, each patch mixture is retrained by 
hard negative mining and linear SVM as in [7] to boost de¬ 
tection accuracy. A 2 x 2 covariance matrix Dj is estimated 
for each landmark j from latent variables , to model the 
uncertainty of the detected landmark position x* • relative to 
the ground truth. 


Unselected Landmarks 



Figure 2: The mean training patches and the positive 
weights of the learned filter for a (view) component of a 
part on the car wheel are shown. Filter learning is boot¬ 
strapped by clustering. The top and bottom rows correspond 
to the results before and after the latent update, respectively. 
The latent update procedure updates the center position and 
scale for each training patch by detecting the patches in the 
local region of the 2D landmark, and retrain the filters us¬ 
ing better aligned training patches. The procedure results in 
more concentrated weights in learned filters. 



Figure 3: Visualization of the landmark selection optimiza¬ 
tion result. All 256 landmark points of a car are shown in 
circle markers. The color of the markers represents the Av¬ 
erage Precision(AP) of the landmark part detection on the 
training set, red means higher AP and blue means lower AP. 
The size of the landmark represents the selection result, the 
larger ones are selected via the MIP optimization and the 
smaller ones are not selected. The red landmarks are pre¬ 
ferred since they have higher detection accuracy, but only a 
subset of red landmarks are selected because they are close 
in 3D. 


3.2. Selecting Discriminative Landmarks 

Seeking a compact representation of the object, we try to 
select only a small subset of discriminative landmarks Sd 
among all 3D landmarks S. We want the selected landmarks 
Sd to be both associated with discriminative part models 
and have a good spatial coverage of the object shape model 
in 3D. The selection problem is formulated as a facility lo¬ 
cation problem, 

min S') z u y u + A Y' duv%uv 5 (3) 

y u ,x uv z ' 

u uv 

S.t. ^ ^ X uv — 1? 

V 

Xuv<Vv , \/u,V, 

XuV 5 y u e {0,1}, Vu,v, 

where the interpretations of each symbol are presented in 
Table 1. 


Symbol 

Interpretation 

Zu 

cost of selecting landmark u 

y u 

binary landmark selection variable 

duv 

cost of landmark v “serving” u 

Xuv 

binary variable for landmark v “serving” u 

A 

trade off between unary costs and binary costs 


Table 1: Notations interpretation in (3) 


The cost z u for a landmark u should be lower if the 


associated part model is more discriminative. We model 
the discriminativeness by evaluating the Average Precision 
(AP) of detecting each landmark in the training set. For 
any landmark u , we perform detection with the learned part 
model in the training set S to generate a list of location hy¬ 
potheses H u . A hypothesis h G H u is considered as true 
positive if the ground truth location Li u is within a small 
radius 8. Let the computed AP for a part u be AP U , we 
set z u = 1 — AP U . The cost of “serving” (or suppress¬ 
ing) other landmarks are set to be the euclidean distance 
between landmarks in 3D, i.e., d uv = \\S U — S'ulh- The 
value of A is set to 1 in our experiments. The minimization 
problem 3 is a Mixed Integer Programming (MIP) problem, 
which is known to be NP-hard. But a good approximation 
solution can be obtained by relaxing the integer constrains 
to be x uv G [0,1], y u G [0,1], solving the relaxed Linear 
Programming problem, and thresholding the solution. Fig¬ 
ure 3 visualizes an example result of MIP optimization for 
landmark selection. 

3.3. 3D Shape Model 

We start our description by explaining how we would 
estimate the pose and shape of an object if 2D part - 3D 
landmark correspondences were known. We represent a 3D 
object model as a linear combination of a few basis shapes 
to constrain the shape variability. This assumption has been 
widely used in various shape-related problems such as ob¬ 
ject segmentation [4], single image-based shape recovery 








[1 ] and nonrigid structure from motion [3]. We use a 
weak-perspective model, which is a good approximation 
when the depth of the object is smaller than the distance 
from the camera. With these two assumptions, the 2D part 
locations P G R 2xp can be described by 

k 

P = R^CiBi +tl T , (4) 

i= 1 

where E M 3xp denotes the i- th basis shape, R E M 2x3 
represents the first two rows of a rotation matrix, and t E M 2 
is the translation vector. In model inference, we try to min¬ 
imize the geometric reprojection error to find the optimal 
parameters. 

However, the model in (4) is bilinear in R and qs yield¬ 
ing a nonconvex problem. In order to have a linear repre¬ 
sentation, we use the shape model proposed in [3 ], which 
assumes that there is a rotation for each basis shape. The 
3D shape model is S = Yli=i CiRiBi, and the 2D part lo¬ 
cations are given by 

k 

P = J2 T i B i+ tl T , (5) 

where Ti E M 2x3 corresponds to the first two rows of Ri 
multiplied by c*. In order to enforce Ti to be orthogonal, 
the spectral norms of Ti s are minimized during model in¬ 
ference. The spectral norm is the largest singular value of a 
matrix, and minimizing it enforces the two singular values 
to be equal, which yields an orthogonal matrix [3' ]. Af¬ 
ter TiS are estimated, CjS and RjS are derived from T^s and 
the shape is reconstructed by S = Yli=i c iR%Bi. Note that 
the reconstructed shape is in the camera frame, and we com¬ 
pute a single rotation matrix R by aligning the reconstructed 
shape to the canonical pose. 

4. Model Inference 

Finally, we obtain global geometry-constrained local- 
part models, in which the unknowns are the 2D part loca¬ 
tions as well as the 3D pose and shape. In model inference, 
we maximize the detector responses over the part locations 
while minimizing the geometric reprojection error. 


Geometric consistency is imposed by minimizing the 
following reprojection error: 


f geom (^1 ? * ‘ * 5 ^ 1 5 ’ ' * 1 Rk 1 f ) — 


1 p 
l E 

D T ( L J*i ~ 

~ k 

Z T * Bi 


3 = 1 

\ 

_i= 1 

3 / 


where we concatenate the 2D locations of hypotheses for 
part j in Lj El lx2 and denote the covariance estimated in 
training as Dj . 

As introduced in Section 3.3, we add the following reg- 
ularizer to enforce the orthogonality of Ti : 

k 

freg{T lr -- ,T k ) = ]T \\Ti\\ 2 , (8) 

i=1 

where we use ||T-1| 2 to represent the spectral norm of Ti, 
i.e., the largest singular value. 

To simplify the computation, we relax the binary con¬ 
straint on and allow it to be a soft-assignment vector 
E A, where 4 = {xE [0.1]*| Yl\=i x i = 

Finally, the objective function reads 

min fgeom(X, T, t) + Xlf S core{X) + \2freg(T), (9) 
X,T, t 

s.t. Xj E A, Vj = 1 : p, 

where X and T represent the unions of xi, • • • , x p and 
Ri, • • • , Rk, respectively. After solving (9), we recover the 
3D shape S and pose 0 = (i?, t) from Ti s, as introduced in 
Section 3.3. 

4.2. Optimization 

The problem in (9) is convex since f score is a linear term, 
fgeom is the sum of squares of linear terms, and f reg is the 
sum of norms of unknown variables. We use the alternat¬ 
ing direction method of multipliers (ADMM) [2] to solve 
the convex problem in (9). Since f reg is nondifferentiable, 
which is not straightforward to optimize, we introduce an 
auxiliary variable Z and reformulate the problem as fol¬ 
lows: 


4.1. Objective Function 

We try to locate a part by finding its correspondence in 
a set of hypotheses given by the trained detector. The cost 
without geometric constraints is 

p 

/score(X1, • • • , Xp) = - ^ r fxj, (6) 

3=1 


_min_ fgeom(X, T, t) + Xlfscore(X) + A 2 f re g{Z), 
X,T,t,Z 

( 10 ) 

s.t. T = Z , 

Xj E A, Vj = 1 : p. 


The corresponding augmented Lagrangian is: 


where Xj E {0,1}* is the selection vector and Yj E M* is 
the vector of the detection scores for all hypotheses for the 
j-th part. 


£ — fgeom{X,T , t) + Ai f SC ore{X) + A 2freg{Z) 

+ (Y,T-Z) + P -\\T-Zf F . 


( 11 ) 








The ADMM algorithm iteratively updates variables by the 
following steps to find the stationary point of (11): 


t arg min £, 

(12) 

X arg min£, 

(13) 

T <— arg min C, 

T 

(14) 

Z <— arg min C, 
z 

(15) 

Y <— p(T — ~Z). 

(16) 


It can be shown that (12), (13) and (14) are all quadrat- 
ical programming problems, which have closed-form so¬ 
lution or can be solved efficiently using existing convex 
solvers. (15) is a spectral-norm regularized proximal prob¬ 
lem, which also admits a closed-form solution [36]. 

4.3. Visibility Estimation 

In model inference, only visible landmarks should be 
considered. To estimate the unknown visibility, we adopt 
the following strategy. We first assume that all landmarks 
are visible and solve our model in (9) to obtain a rough es¬ 
timate of the viewpoint. Since the landmark visibility of a 
car only depends on the aspect graph, the roughly estimated 
viewpoint can give us a good estimate of the landmark vis¬ 
ibility. We observed that our model could reliably estimate 
the coarse view by assuming the full visibility, which might 
be attributed to the global optimization. After obtaining the 
visibility, we solve our model again by only considering the 
visible landmarks. The full shape can be reconstructed by 
the linear combination of full meshes of basis shapes after 
the coefficients are estimated. 

4.4. Successive Refinement 

The relaxation of binary selection vectors x^s in (9) may 
yield inaccurate localization, since it allows the landmark 
to be located inside the convex hull of the hypotheses. To 
improve the precision, we apply the following scheme: we 
solve our model in (9) repeatedly, and in each iteration we 
define a trust region based on the previous result for each 
landmark and merely keep the hypotheses inside the trust 
region as the input to fit the model again. We use three 
iterations. We can start from a large trust region to achieve 
global fitting and gradually decrease the trust region size 
in each iteration to reject outliers and improve localization. 
This successive refinement scheme has been widely-used 
for feature matching [18, 16]. 

We summarize the inference process in Algorithm 1. 

5. Experiments 

In this section, we evaluate our method (PSP) in terms 
of both pose and shape estimation accuracy. The experi¬ 
ments are carried out on the Fine Grained 3D Car dataset 


Algorithm 1: Outline of the inference process 

Input: 2D hypotheses for parts {Hi = ( Li, n) \ i = 1 : m} 
3D basis shapes {Bi \ i = 1 : k} and mean shape Bo 
Output: Estimated pose 0 — (R, t) and shape S 

/* optimize({Hi}, {Bi}) means solving (9) with 
hypotheses {Hi} and basis shapes {Bi} to 
recover 6 and S (Section 4.1) */ 

/* Estimate visibility (Section 4.3) */ 

0, S <— optimize({Hi | i = 1 : m}, Bo); 

V <— visibility(0); 

/* Prune hypotheses (Section 4.4) */ 

0,S <— optimize({Hi i E V}, Bo)-, 

Hi «— pruning (Hi, Si), Vi E V; 

/* Refine pose & shape with shape space */ 

Update 0, S optimize({Hi\ i E V}, {Bi, \ i = 1 : /c}); 


(FG3DCar) [21], since it is the only dataset with both land¬ 
mark projection in the image and pose annotation for 3D 
objects. The dataset consists of 300 images with 30 differ¬ 
ent car models of 6 car types under different viewing angles. 
Each car instance is associated a shape model of 256 3D 
landmark points and their projected 2D locations annotated 
in the image as well as 3D pose annotation. We perform 
the following evaluations: First, we compare the accuracy 
of pose and shape estimation to the iterative model fitting 
method of [21] (FG3D) in terms of 2D landmark projection 
error. Second, we compare the coarse viewpoint estima¬ 
tion error to viewpoint-DPM (V-DPM) [12, 32]. In addition, 
since our viewpoint estimation is continuous, we also show 
the angular errors comparing to the groundtruth annotation. 
Through out the experiments, we follow the same training¬ 
testing split as [2 ], half of the images are used for training 
and half for testing. We increase the training set size by left- 
right flip the training images and using symmetry to flip the 
landmark visibility labels. 

During training we learn a mixture of discriminative 
part models of three components for each of 256 landmark 
points as described in Section 3. The Average Precision 
(AP) of the landmark detection is evaluated on the training 
set. We count a detection as true positive only if the detected 
landmark location is within 20 pixels distance to the anno¬ 
tated location, otherwise it is counted as a false positive. We 
optimize the landmark selection with unary cost as 1 - AP 
of each landmark and pairwise cost as the average pairwise 
3D distance over all the 3D models in the training set. 52 
out of 256 landmark points are selected resulting from the 
MIP optimization. To build the shape models, we learned a 
dictionary consisting of 10 basis shapes from the 3D mod¬ 
els provided in the FG3DCar dataset. We use Ai = 1000 
and A 2 = 50 in (9) during inference. 

Note that, unlike FG3D, our method does not need an 
external object detector to initialize either the location and 





Method 

meanAPD (SL) 

meanAPD 

PSP Mean shape 

16.5 

20.6 

PSP Class mean 

15.4 

18.9 

PSP Shape space 

14.6 

17.7 

FG3D Class mean 

- 

18.1 

FG3D Shape space 

- 

20.3 


Table 2: Model fitting error of PSP versus FG3D in terms of 
mean APD in pixels evaluated on 52 selected discriminative 
landmarks (SL) and 64 landmarks provided in the dataset. 



Figure 4: Car type specific mean APD of PSP versus FG3D 
with mean prior and class prior. Comparing to FG3D 
method, our method achieves lower meanAPD on most car 
types. For the type of pickup truck, our method significantly 
outperforms FG3D. 



Accuracy 

Method 

40° per view 

20° per view 

V-DPM 

PSP 

82.7% 

89.3% 

71.3% 

84.7% 


Table 3: Coarse viewpoint estimation accuracy versus V- 
DPM evaluated on FG3DCar dataset. Accuracies are com¬ 
pared with two discretization schemes, 20 degrees per 
coarse viewpoint and 40 degrees per coarse viewpoint. 
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Figure 5: Continuous viewpoint (azimuth) error compar¬ 
ing to the groundtruth on all 150 test images in FG3DCar 
dataset. The mean error is 3.4 in degrees. 

scale in the image or coarse landmark locations. We per¬ 
form pose and shape estimation on the original image with 
background clutter. 


3D Pose and Shape Estimation Pose and shape estima¬ 
tion accuracy is evaluated in terms of meanAPD which is 
the average landmark projection error in pixels over the 
landmarks and the test instances. In the following experi¬ 
ments, we investigate the effect of using different 3D shapes 
on the model fitting error. We compare three setups with 
different basis shapes: mean shape, mean class shape and 
shape space (10 basis shapes). The middle column of Ta¬ 
ble 2 shows the fitting error on selected discriminative land¬ 
marks. The fitting error decreases when we use shape space 
instead of both mean shape and class mean, which validates 
the use of shape space to express intra-class shape variation. 

Since the selected discriminative landmarks are not iden¬ 
tical to the landmarks provided in the FG3DCar dataset, we 
also compare the meanAPD on the landmarks provided in 
the dataset. Our method outperforms FG3D using the shape 
space without knowing the class type. Note that, their de¬ 
tectors are trained on the manually selected 64 landmarks 
provided in the dataset while our detectors are trained on 
the 52 automatic selected discriminative landmarks. 

Although our objective is to optimize the projection er¬ 
ror on the discriminative landmarks, the fitting error on the 
dataset provided landmarks is also minimized. This shows 
the effectiveness of the landmark selection process. The er¬ 
ror is reported on the same scale as FG3D. Figure 4 shows 
the per class 3D model fitting error. Our method outper¬ 
forms FG3D on most class types with particular success on 
the pickup trucks. 


Viewpoint Estimation We compare PSP to V-DPM in 
discrete viewpoint estimation accuracy. For V-DPM we 
train two sets of baseline V-DPM with coarse viewpoints 
(azimuth) of every 20 degrees and every 40 degrees for 
each view. Each component of V-DPM corresponding to 
a viewpoint label. During inference, the viewpoint of the 
test car instance is predicted as the training viewpoint of the 
max scoring component. For PSP, the estimated continu¬ 
ous viewpoint is discretized the same way as V-DPM. Table 
3 shows the comparison of the two methods. In both two 
cases, PSP outperform V-DPM. We further analyze the esti¬ 
mation error of PSP by looking at continuous viewpoint es¬ 
timation error and show that the majority error is introduced 
by discretization. We compare our estimation to ground- 
truth viewpoint (azimuth) and report the absolute angular 
value in Figure 5. The mean error over the whole test set is 
only 3.4 in degree. 

In addition to the quantitative evaluations, we show qual¬ 
itative results on the test images from FG3DCar in Figure 
6, where we project the 3D model wireframe with the esti¬ 
mated pose and shape on to the image. We also show the 
textured model rendered at novel views. 






























Figure 6: On the first two rows, the 3D wire frame of the car model is projected on the image with estimated pose and shape. 
Red solid lines represent visible wire frames and blue dotted lines represent invisible wire frames. Our method robustly 
estimates the pose and shape for all car type and different view angles. On the last row, the textured 3D reconstructions of 
the cars on the fourth row are rendered at novel viewpoints. (We use symmetry to texture the invisible faces). 

6. Conclusion space model to represent the geometric relation. In model 

inference, we simultaneously localized the parts, estimated 

y\t i i , r .. .. the pose, and recovered the 3D shape by solving a convex 

We proposed a novel approach tor estimating the pose . 

and the shape of a 3D object from a single image. Our ap- program wit ADMM. 

proach is based on a collection of automatically-selected 

and discriminatively-trained 2D parts with a 3D shape- 
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